Linear Regression
# Read in data
wine = read.csv("C:/Users/pySag/Desktop/MITx15.071x/Datasets/wine.csv")
str(wine)
## 'data.frame': 25 obs. of 7 variables:
## $ Year : int 1952 1953 1955 1957 1958 1959 1960 1961 1962 1963 ...
## $ Price : num 7.5 8.04 7.69 6.98 6.78 ...
## $ WinterRain : int 600 690 502 420 582 485 763 830 697 608 ...
## $ AGST : num 17.1 16.7 17.1 16.1 16.4 ...
## $ HarvestRain: int 160 80 130 110 187 187 290 38 52 155 ...
## $ Age : int 31 30 28 26 25 24 23 22 21 20 ...
## $ FrancePop : num 43184 43495 44218 45152 45654 ...
# 1. Year gives the year the wine was produced, a unique identifier for each observation.
# 2. Price ---> Dependent variable we are trying to predict.
# 3. Independent variables := WinterRain, AGST, HarvestRain, Age & FrancePop.
# 3.1 We'll use this to predict the price.
# Linear Regression (one variable)
summary(wine)
## Year Price WinterRain AGST
## Min. :1952 Min. :6.205 Min. :376.0 Min. :14.98
## 1st Qu.:1960 1st Qu.:6.519 1st Qu.:536.0 1st Qu.:16.20
## Median :1966 Median :7.121 Median :600.0 Median :16.53
## Mean :1966 Mean :7.067 Mean :605.3 Mean :16.51
## 3rd Qu.:1972 3rd Qu.:7.495 3rd Qu.:697.0 3rd Qu.:17.07
## Max. :1978 Max. :8.494 Max. :830.0 Max. :17.65
## HarvestRain Age FrancePop
## Min. : 38.0 Min. : 5.0 Min. :43184
## 1st Qu.: 89.0 1st Qu.:11.0 1st Qu.:46584
## Median :130.0 Median :17.0 Median :50255
## Mean :148.6 Mean :17.2 Mean :49694
## 3rd Qu.:187.0 3rd Qu.:23.0 3rd Qu.:52894
## Max. :292.0 Max. :31.0 Max. :54602
# The below code create one variable Linear Regression equation.
# Using AGST as our one independent variable to predict price.
# We will use lm() offered by R, lm - stands for "linear model".
# This is the real deal to create linear regression models.
# 1st argument : formula,
# 2nd argument : data frame, in this case, it our "wine data frame"
model1 = lm(Price ~ AGST, data=wine)
summary(model1)
##
## Call:
## lm(formula = Price ~ AGST, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.78450 -0.23882 -0.03727 0.38992 0.90318
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.4178 2.4935 -1.371 0.183710
## AGST 0.6351 0.1509 4.208 0.000335 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4993 on 23 degrees of freedom
## Multiple R-squared: 0.435, Adjusted R-squared: 0.4105
## F-statistic: 17.71 on 1 and 23 DF, p-value: 0.000335
# Let's break it down :
# call : is a description of our model # Residuals ~ Errors
# Description of the coefficient of our model.
# 1st row : Intercept terms.
# 2nd row : Independent variable, AGST we passed in to our linear model.
# 1st col : Estimate, gives estimates of the "Beta" value of our model.
# More spcifically, "(Beta not, or Beta 0), the coefficient of "intercept term".
# Beta0 := -3.4
# Beta1 := 0.635 ; The Coefficient of our independent variable (AGST).
# 3rd last output is the thing we are interested in.
# "Residual Std.error := 0.4993 with 23 d.o.f( degree of freedom).
# 2nd last output : Multiple R-squared := 0.435.
# & Adjusted R-squared := 0.4105
# This adjusts R-squared value : Account for the number of independent variables used 'relative' to the no. of data.
# [Note1] : Multiple R-squared will always increase, if we add more indep.var, and vice-versa for Adjusted R-squared "that's of no use to the model".
# [Note2] : This is a good way to determine if an additional variale is cared to be included into the model or not.
# ( Sum of Squared Errors )
# Our residuals( errors ) := vector "model1"
model1 $ residuals
## 1 2 3 4 5 6
## 0.04204258 0.82983774 0.21169394 0.15609432 -0.23119140 0.38991701
## 7 8 9 10 11 12
## -0.48959140 0.90318115 0.45372410 0.14887461 -0.23882157 -0.08974238
## 13 14 15 16 17 18
## 0.66185660 -0.05211511 -0.62726647 -0.74714947 0.42113502 -0.03727441
## 19 20 21 22 23 24
## 0.10685278 -0.78450270 -0.64017590 -0.05508720 -0.67055321 -0.22040381
## 25
## 0.55866518
SSE = sum(model1$residuals^2)
SSE
## [1] 5.734875
# Linear Regression (two variables)
# To add another independent variable, use "+" sign followed by the desired indep.variable of your choice.
model2 = lm(Price ~ AGST + HarvestRain, data=wine)
summary(model2)
##
## Call:
## lm(formula = Price ~ AGST + HarvestRain, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.88321 -0.19600 0.06178 0.15379 0.59722
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.20265 1.85443 -1.188 0.247585
## AGST 0.60262 0.11128 5.415 1.94e-05 ***
## HarvestRain -0.00457 0.00101 -4.525 0.000167 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3674 on 22 degrees of freedom
## Multiple R-squared: 0.7074, Adjusted R-squared: 0.6808
## F-statistic: 26.59 on 2 and 22 DF, p-value: 1.347e-06
# Interpretation of our model:
# (New) 3rd row : "HarvestRain" is added to our model.
# Coefficient estimates := -0.00457
# Lastly, "Multiple R-squared" & "Adjusted R-squared" has improved.
# This means adding new indep.var has helped our model.
# (Conclusion) : New model is "better" then our Previous model.
# Sum of Squared Errors
SSE = sum(model2 $ residuals^2)
SSE
## [1] 2.970373
# (inference):
# (New) SSE is a improved version of our (Old) SSE.
#=============================================================#
# Linear Regression (all variables)
model3 = lm(Price ~ AGST + HarvestRain + WinterRain + Age + FrancePop, data=wine)
summary(model3)
##
## Call:
## lm(formula = Price ~ AGST + HarvestRain + WinterRain + Age +
## FrancePop, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.48179 -0.24662 -0.00726 0.22012 0.51987
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.504e-01 1.019e+01 -0.044 0.965202
## AGST 6.012e-01 1.030e-01 5.836 1.27e-05 ***
## HarvestRain -3.958e-03 8.751e-04 -4.523 0.000233 ***
## WinterRain 1.043e-03 5.310e-04 1.963 0.064416 .
## Age 5.847e-04 7.900e-02 0.007 0.994172
## FrancePop -4.953e-05 1.667e-04 -0.297 0.769578
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3019 on 19 degrees of freedom
## Multiple R-squared: 0.8294, Adjusted R-squared: 0.7845
## F-statistic: 18.47 on 5 and 19 DF, p-value: 1.044e-06
# (inference):
# We've got 3 new rows, with 3 new independent variables.
# Our Residual Std error has slightly decreased from our former model.
# Multiple R-squared & Adjusted R-squared has increased, courtesy of new variables that are being added recently to our new model.
# Sum of Squared Errors
SSE = sum(model3$residuals^2)
SSE
## [1] 1.732113
# (inference) :
# Our SSE has improved sig.fig then our previous model!
#[Quiz]: Create a linear regression model to predict the Price using "HarvestRain" & "WinterRain" as indep.var, and answer the following questions:
model4 <- lm( Price ~ HarvestRain + WinterRain, data <- wine)
summary(model4)
##
## Call:
## lm(formula = Price ~ HarvestRain + WinterRain, data = data <- wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0933 -0.3222 -0.1012 0.3871 1.1877
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.865e+00 6.616e-01 11.888 4.76e-11 ***
## HarvestRain -4.971e-03 1.601e-03 -3.105 0.00516 **
## WinterRain -9.848e-05 9.007e-04 -0.109 0.91392
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5611 on 22 degrees of freedom
## Multiple R-squared: 0.3177, Adjusted R-squared: 0.2557
## F-statistic: 5.122 on 2 and 22 DF, p-value: 0.01492
#==================
# Video 5 : Understanding the model and Coefficient
#1st Colomn: ( The Estimate )
# - Gives the Coefficient for --> Intercepts, & for each of the indep.var in our model.
# - A Coefficient of 0 --> No change to model.
# - If its not --> remove the var. from our model.
# - Since they are non-contributing to the performance of our model.
# 2nd Colomn: ( The Standard Error )
# - Measures --> How much the coefficient is likely to vary from the estimate.
# 3rd Colomn: ( The t-value )
# - Measure --> Estimate / Std.error
# - -ive, if estimate is -ive or vice-versa.
# - Larger its |t-value|, More likely is the coefficient --> Sig.fig.
# - Thus we want indep.var with > |t-value|, in this coloumn.
# 4th colomn: Pr( > |t| )
# - Measure --> How plausible --> Coefficient is == 0, given the data.
# - The < it is, the < likely is our Coefficient estimate == 0.
# - This no. is >, if |t-value| is < & vice-versa.
# - We want indep.var with < value in this coloumn.
# [Note] : The easiest way to determine if a var. is sig.fig is to look at the folowwing at the end of each row :-
# - *** : Highest level of sig.fig, p-value < 0.001
# - ** : Very Sig.fig, p-value b/w 0.001 & 0.01
# - * : Sig.fig, p-value b/w 0.01 & 0.05.
# - A dot(.) : Almost sig.fig, p-value b/w 0.05 & 0.10.
# If we look our Coefficient table, Age & FrancePopulation are both in.sig.fig.
# Let's remove Age & FrancePopulation from our model.
model5 <- lm(Price ~ AGST + HarvestRain + WinterRain + Age, data <- wine)
summary(model5)
##
## Call:
## lm(formula = Price ~ AGST + HarvestRain + WinterRain + Age, data = data <- wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.45470 -0.24273 0.00752 0.19773 0.53637
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.4299802 1.7658975 -1.942 0.066311 .
## AGST 0.6072093 0.0987022 6.152 5.2e-06 ***
## HarvestRain -0.0039715 0.0008538 -4.652 0.000154 ***
## WinterRain 0.0010755 0.0005073 2.120 0.046694 *
## Age 0.0239308 0.0080969 2.956 0.007819 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.295 on 20 degrees of freedom
## Multiple R-squared: 0.8286, Adjusted R-squared: 0.7943
## F-statistic: 24.17 on 4 and 20 DF, p-value: 2.036e-07
# Adjusted R squared is slightly improved.
# Interestingly, Age which was in.sig.fig earlier is now oddly become sig.fig.
# Why is it so?
# Answer: It's due to multicollinearity i.e Age & FrancePopulation are highley correlated.
# To check our assumption, let's build another model and test this claim.
summary(wine)
## Year Price WinterRain AGST
## Min. :1952 Min. :6.205 Min. :376.0 Min. :14.98
## 1st Qu.:1960 1st Qu.:6.519 1st Qu.:536.0 1st Qu.:16.20
## Median :1966 Median :7.121 Median :600.0 Median :16.53
## Mean :1966 Mean :7.067 Mean :605.3 Mean :16.51
## 3rd Qu.:1972 3rd Qu.:7.495 3rd Qu.:697.0 3rd Qu.:17.07
## Max. :1978 Max. :8.494 Max. :830.0 Max. :17.65
## HarvestRain Age FrancePop
## Min. : 38.0 Min. : 5.0 Min. :43184
## 1st Qu.: 89.0 1st Qu.:11.0 1st Qu.:46584
## Median :130.0 Median :17.0 Median :50255
## Mean :148.6 Mean :17.2 Mean :49694
## 3rd Qu.:187.0 3rd Qu.:23.0 3rd Qu.:52894
## Max. :292.0 Max. :31.0 Max. :54602
model6 <- lm(Price ~ AGST + HarvestRain + WinterRain + FrancePop, data <- wine)
summary(model6)
##
## Call:
## lm(formula = Price ~ AGST + HarvestRain + WinterRain + FrancePop,
## data = data <- wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.48252 -0.24636 -0.00699 0.22089 0.51949
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.768e-01 2.180e+00 -0.173 0.864529
## AGST 6.011e-01 9.898e-02 6.073 6.17e-06 ***
## HarvestRain -3.958e-03 8.518e-04 -4.646 0.000156 ***
## WinterRain 1.042e-03 5.070e-04 2.055 0.053202 .
## FrancePop -5.075e-05 1.704e-05 -2.978 0.007434 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2943 on 20 degrees of freedom
## Multiple R-squared: 0.8294, Adjusted R-squared: 0.7952
## F-statistic: 24.3 on 4 and 20 DF, p-value: 1.945e-07
# It look like our claim is true.
Corellation & Mulitcollinearity
Corellation is a measure of th linear relationship b/w variables.
Whereas, Mulitcolliniearity –> two indep.var are highley corellated.
+1 := perfect positive linear Relationship.
0 := No linear relationship.
-1 := perfect negative linear realtionship.
cor( wine $ WinterRain, wine $ Price )
## [1] 0.1366505
cor( wine $ Age, wine $ FrancePop )
## [1] -0.9944851
cor( wine )
## Year Price WinterRain AGST HarvestRain
## Year 1.00000000 -0.4477679 0.016970024 -0.24691585 0.02800907
## Price -0.44776786 1.0000000 0.136650547 0.65956286 -0.56332190
## WinterRain 0.01697002 0.1366505 1.000000000 -0.32109061 -0.27544085
## AGST -0.24691585 0.6595629 -0.321090611 1.00000000 -0.06449593
## HarvestRain 0.02800907 -0.5633219 -0.275440854 -0.06449593 1.00000000
## Age -1.00000000 0.4477679 -0.016970024 0.24691585 -0.02800907
## FrancePop 0.99448510 -0.4668616 -0.001621627 -0.25916227 0.04126439
## Age FrancePop
## Year -1.00000000 0.994485097
## Price 0.44776786 -0.466861641
## WinterRain -0.01697002 -0.001621627
## AGST 0.24691585 -0.259162274
## HarvestRain -0.02800907 0.041264394
## Age 1.00000000 -0.994485097
## FrancePop -0.99448510 1.000000000
model7 <- lm( formula <- Price ~ AGST + HarvestRain + WinterRain, data <- wine )
summary(model7)
##
## Call:
## lm(formula = formula <- Price ~ AGST + HarvestRain + WinterRain,
## data = data <- wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.67472 -0.12958 0.01973 0.20751 0.63846
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.3016263 2.0366743 -2.112 0.046831 *
## AGST 0.6810242 0.1117011 6.097 4.75e-06 ***
## HarvestRain -0.0039481 0.0009987 -3.953 0.000726 ***
## WinterRain 0.0011765 0.0005920 1.987 0.060097 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.345 on 21 degrees of freedom
## Multiple R-squared: 0.7537, Adjusted R-squared: 0.7185
## F-statistic: 21.42 on 3 and 21 DF, p-value: 1.359e-06
Inference :
It turns out that our model looks pretty much the same, except one thing!
Our R-squared dropped down to 0.75 vs the model consisting of Age with 0.83.
Question is should we keep Age or FrancePop?
Removing Age & FrancePop at the same time has caused us to miss out a sig.fig variable.
Also, our R-squared value is lowered.
Conclusion: It turns out that it’s better to keep Age rather than FrancPop has intitutivly “Older wines our typically more expensive”, thus Age makes more sense
Typically, a corellation > 0.7 or < than -0.7 is cause for concern.
Making Predictions
Let’s jump in!
# We have to data points that we haven't used to build our model.
# These are contained in file : "wine_test.csv".
wineTest <- read.csv("C:/Users/pySag/Desktop/MITx15.071x/Datasets/wine_test.csv")
str(wineTest)
## 'data.frame': 2 obs. of 7 variables:
## $ Year : int 1979 1980
## $ Price : num 6.95 6.5
## $ WinterRain : int 717 578
## $ AGST : num 16.2 16
## $ HarvestRain: int 122 74
## $ Age : int 4 3
## $ FrancePop : num 54836 55110
summary(wineTest)
## Year Price WinterRain AGST
## Min. :1979 Min. :6.498 Min. :578.0 Min. :16.00
## 1st Qu.:1979 1st Qu.:6.612 1st Qu.:612.8 1st Qu.:16.04
## Median :1980 Median :6.726 Median :647.5 Median :16.08
## Mean :1980 Mean :6.726 Mean :647.5 Mean :16.08
## 3rd Qu.:1980 3rd Qu.:6.840 3rd Qu.:682.2 3rd Qu.:16.13
## Max. :1980 Max. :6.954 Max. :717.0 Max. :16.17
## HarvestRain Age FrancePop
## Min. : 74 Min. :3.00 Min. :54836
## 1st Qu.: 86 1st Qu.:3.25 1st Qu.:54904
## Median : 98 Median :3.50 Median :54973
## Mean : 98 Mean :3.50 Mean :54973
## 3rd Qu.:110 3rd Qu.:3.75 3rd Qu.:55042
## Max. :122 Max. :4.00 Max. :55110
# To make a predictions for these two test poing, R offers the "predict()"
# We are going to be using our prevvious model no. 5
predictTest <- predict(model5, newdata <- wineTest)
predictTest
## 1 2
## 6.768925 6.684910
Inference : Let’s understand what our output is trying to convey info.
Prediction price values –> closely matches to price values of test data.
But we can quantify this by computing the R-squared value for our test set.
Forumuala for R-squared is 1 - SSE / SST
SSE <- sum( ( wineTest $ Price - predictTest ) ^ 2)
SST <- sum( ( wineTest $ Price - mean ( wine $ Price ) ) ^ 2)
# R ^ 2
1 - SSE / SST
## [1] 0.7944278
# Are we done? is that best we can do ?
# It turns out that out test data is very small, if we want a good "out-of-sample" accuracy in our prediction.
Out-of-Sample R^2
Model R-squared increases as we add up more independent variables.
However for Test R-squared , its not True.
Further, we need more data to be conclusive enough.
Out-of-sample R-squared can be -ive.